AllLife Bank Customer Segment

Objective

To identify different segments in the existing customer, based on their spending patterns as well as past interaction with the bank, using clustering algorithms, and provide recommendations to the bank on how to better market to and service these customers.

Data Discription

The data provided is of various customers of a bank and their financial attributes like credit limit, the total number of credit cards the customer has, and different channels through which customers have contacted the bank for any queries (including visiting the bank, online and through a call center).

Data Dictionary

Libraries

Loading the Dataset

Dataset Summary

Shape of the data

Missing Values

There are no missing values in the dataset

Duplicate data

There are no duplicate records

Sample records

Data Types

All the attributes are integers

Checking keys

There are only one possible attribute for key, the Customer Key. We'll first convert the Sl_No to index, and then check the Customer Key. Since this data is to be used for customer segmentation, findig the customer key is essential.

I have already checked this in excel, hence depicting the same in the notebook. Considering the small size of the dataset, is it really convenient to check a few initial things in the excel file itself first.

There appears to be 5 duplicates in the customer key

Analyzing the records with duplicate customer keys

The records for same Customer Key look really different from each other. I am assuming this is either mistake in the Customer Key assignment, or we are missing current_version_indicator in the dataset. As of now, I am going to consider these as two different customers. After the clustering, I will analyze the groups corresponding to these sets of records

Standardizing Column Names

Column Statistinc

Exploratory Data Analysis

Univariate Analysis

The first step of univariate analysis is to check the distribution/spread of the data. This is done using primarily histograms and box plots. Additionally we'll plot each numerical feature on violin plot and cumulative density distribution plot. For these 4 kind of plots, we are building below summary() function to plot each of the numerical attributes. Also, we'll display feature-wise 5 point summary.

Summary of Avergae Credit Limit

The attribute is right skewed with a lot of outliers

Summary of Total Number of Credit Cards

The attribute is fairly normally distributed with a few spikes

Summary of Total Number of Visits to the Bank

The data is slightly right skewed

Summary of Total Online Visits

The data is right skewed and has some outliers to the right

Summary of Total Calls Made to the Bank

The data is right skewed

Labeled Bar-plots

Creating a function to plot labeled bar plot of the features, with percentage label on data bars.

Creating a credit card limit bin out of the available data in the avg credit limit feature

Credit Card Limit Bins

Total Credit Cards

Total Visits to the Bank

Total Online Visits

Total Calls Made

Bi-variate Analysis

Pair Plot

Heatmap

Average Credit Limit distribution by Each of the Other Attributes

We can see clear segmentations with respect to each pair of features

Data Preprocessing

Before clustering, we should always scale the data, because, different scales of features would result in unintentional importance to the feature of higher scale while calculating the distances.

K-means Clustering

K-means is often referred to as Lloyd’s algorithm. In basic terms, the algorithm has three steps. The first step chooses the initial centroids, with the most basic method being to choose samples from the dataset . After initialization, K-means consists of looping between the two other steps. The first step assigns each sample to its nearest centroid. The second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed and the algorithm repeats these last two steps until this value is less than a threshold. In other words, it repeats until the centroids do not move significantly.

Finding the best numbr of centroids (K)

Elbow Curve to get the right number of Clusters

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

Appropriate value for k seems to be 3

Silhouette Scores

silhouette score=𝑝−𝑞𝑚𝑎𝑥(𝑝,𝑞)

𝑝 is the mean distance to the points in the nearest cluster that the data point is not a part of

𝑞 is the mean intra-cluster distance to all the points in its own cluster.

The value of the silhouette score range lies between -1 to 1.

A score closer to 1 indicates that the data point is very similar to other data points in the cluster,

A score closer to -1 indicates that the data point is not similar to the data points in its cluster.

Silhouette score for 3 clusters is highest. So, we will choose 3 as value of k.

Let's also visualize the silhouettes created by each of the clusters for two values of K, 3 and 4

Visualize the Silhouettes

Clearly, 3 clusters seem very reasonable

Build the model with 3 centroids

Add the cluster numbers as a new attribute in the dataset

Customer Profiling - Visualize the Clusters with Features

It appears, the method of contacting the bank (In Person/ Online/ Call) drives the clustering mechanism predominantly, explore this in below plot

Clusters Obtained from K-Means technique:

Analyzing the segments using Box Plot

Hierarchical Clustering

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:

Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.

Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.

Average linkage minimizes the average of the distances between all observations of pairs of clusters.

Single linkage minimizes the distance between the closest observations of pairs of clusters.

Before starting clustering we'll remove the cluster column from the dataset.

I am going to try many distance metrics and linkage methods to find the best combination.

Cophenetic Correlations

The cophenetic correlation for a cluster tree is defined as the linear correlation coefficient between the cophenetic distances obtained from the tree, and the original distances (or dissimilarities) used to construct the tree. Thus, it is a measure of how faithfully the tree represents the dissimilarities among observations.

The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.

The magnitude of this value should be very close to 1 for a high-quality solution. This measure can be used to compare alternative cluster solutions obtained using different algorithms.

The cophenetic correlation is maximum with Euclidean distance and Average Linkage.

Dendograms

A dendrogram, in general, is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters.

The cophenetic correlation is highest for average linkage methods. 3 appears to be the appropriate number of clusters from the dendrogram for average linkage

Build Agglomerative Clustering model

Also known as bottom-up approach or hierarchical agglomerative clustering (HAC). Bottom-up algorithms treat each data as a singleton cluster at the outset and then successively agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all data.

Build Model

Assign cluster labels

Clusters Obtained from hierarchical technique:

Cluster profile

Analyzing the segments using Box Plot

Checking the clusters for the duplicated customer keys

If we consider the duplicate records are actually updated records for the same customer, then it can be observed that 3 of the 5 customers have actually changed their clusters/groups. It appears, providing credit limit increase, or turning the customers to digital banking customers, we can actually move the customers to a more desirable and profitable cluster.

Checking the clusters for the duplicated customer keys

If we consider the duplicate records are actually updated records for the same customer, then it can be observed that 3 of the 5 customers have actually changed their clusters/groups. It appears, providing credit limit increase, or turning the customers to digital banking customers, we can actually move the customers to a more desirable and profitable cluster.

Contact method:

A hypothesis that I had going into this was that there would be three clusters for contact method, where customers would stick to their preferred method for interacting with their bank (online, in person, and through the phone). Below we can see a 3D rotating scatter plot which shows my hypothesis was correct.

Actionable insights and Recommendations

There appears to be three distinct categories of customers:

The customer preferences should be used to contact the customers. Online/phone users will probably prefer email/text notifications, while in-person users prefer mail notifications and upselling (when at the bank location).

Also, the phone and in-person customers should be reached out to promote online banking.